-
Notifications
You must be signed in to change notification settings - Fork 338
DAOS-18388 client: handle timeout for CRT_OPC_PROTO_QUERY RPC #17335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ticket title is 'daos_rpc_proto_query() crt_proto_query()failed: DER_TIMEDOUT(-1011): 'Time out'' |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/2/execution/node/451/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/2/execution/node/466/log |
b408c13 to
0863dd6
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17335/5/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17335/5/execution/node/1176/log |
bd43c12 to
bc0c86f
Compare
knard38
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/client/api/rpc.c
Outdated
|
|
||
| /* More retry to the first timeout rank with default timeout. */ | ||
| rank = rproto->first_timeout_rank; | ||
| rproto->timeout = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be rproto->timeout = timeout; in this case? the timeout queried from line 137 will be the 'default timeout' that you mention on line 151
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use rproto->timeout as a flag to indicate we have retried as L142. Related cart level logic will automatically set the new RPC timeout as the default timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why you need to treat 0 here as a special value.
if you set rproto->timeout = timeout, the logic on line 142 will still trigger on a next iteration ((timeout > 0 && timeout <= rproto->timeout)) part.
My concern is that setting it to 0 can lead to issues if someone later decided to do for example '+3', and instead of 'default timeout'+3 you now end up with a timeout of 3 seconds now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see your concern. I will refresh the patch.
|
Ping reviewers, thanks! |
Currently, the initial timeout for CRT_OPC_PROTO_QUERY RPC is only 3 seconds, it will help to get going more quickly when some rank(s) is down. But that increases the risk of query failure with timeout if there are only a few targets in the system and they may be busy or not ready in time when being queried. The patch adds another one CRT_OPC_PROTO_QUERY RPC retry against the rank that has ever reported RPC timeout. Such retry will use default RPC timeout configuration instead of initial small value. Signed-off-by: Fan Yong <[email protected]>
bc0c86f to
379e8ae
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17335/10/testReport/ |
|
let's ask @mchaarawi to have a look at this patch. thanks. |
so the original implementation was endlessly looping and each time increasing the timeout. then @frostedcmos updated the implementation to just query and fail after all ranks tried: the problem with this PR, is that it takes the first rank and just retries with the default timeout for it. so IF that rank is truly down and not just busy, we are not resolving the actual problem reported by the ticket. |
One of the observed CI failure instance is that there was only one target in the system, the test logic call I am not sure whether it is because control plane return succeed too early before server side network to be ready or not. Another failure instance is that there were four targets in the system, all of them were alive. But during the four times query, all the four targets reported something like that: Such logs message repeated on all four targets for about 30 seconds. But during such time windows, client side all the query RPCs got timeout. As for the solution, this patch makes additional one query RPC to the first timeout target with default timeout. According to our observed instances, that is enough to resolve the CI failure. But as @mchaarawi pointed out, in the real environment, if the first bad target is really dead, whether this patch can work depends on the query RPC return failure: Anyway, it is fine for me to increase the initial timeout from 3 seconds to some large value. But I do not know how large will be a good candidate. |
well in this case, the issue is in CI / system test and it looks like we have to either wait for the server to full start before issuing the RPC (pool create or cont create or whatever), or the test / user retry the operation in case we get a timeout. Alternatively, we can add just a testing env variable that increases the proto query timeout to like 30 seconds. So by default, we use 3 seconds, but for those failing test cases in CI we set that env variable to some larger value. |
This 3 seconds timeout is not configuration via environment, instead, it is hard coded inside the logic: `int ... |
I am not sure whether it is easy for control plane (or test logic) to exactly to detect such ready status. If not easy or need more time to wait, then current |
to me it's not a great idea to add core code and make it more complicated just to get around test issues.
right, so i meant we make it configurable or actually retry the command. do you mind if i assign this issue to myself? |
|
@Nasf-Fan For a quick fix, I would suggest to just bump the 3 seconds hardcoded timeout to 30 (but keep increment by 3 on next rank if timeout). this should be more than enough on smaller system of 1 or 2 ranks to get a response. I will work on a more generic approach for later. does that sounds reasonable to you @Nasf-Fan? |
Something to keep in mind with this If the first rpc is sent prematurely just before an engine is fully up, then bumping a timeout to 30 seconds can make everything delayed by 30 seconds as the rpc would need to be resent (on some providers). As such it's preferrable to start with smaller timeout values and gradually bump them up. On the other hand, on some other providers an attempt to reach an engine that is not up can result in an immediate failure, without full timeout expiring. In such cases a client that starts prematurely can end up cycling through all engines in no time. A different approach might be to keep something like 'total_wait_time' in the proto query and exit the retry loop when the wait time exceeds some threshold (perhaps env configurable), while keeping individual proto query rpc timeouts relatively short. Wait time will then need to be calculated as actual time spent on a client, instead of relying on a timeout value. My thinking here is that if we can't get proto query to succeed with 5-10second timeout after repeated retries to different engines, then we have some serious issues already and shouldn't continue, as it would fail anyway and make further debug more complicated. |
I did not know whom was the right one to handle such proto query issue, then I made the workaround patch by myself. Anyway, I will assign DAOS-18388 to you @mchaarawi |
yes that is fine really. in production this will not happen. in CI, it means we will have a 30 sec timeout, but by that time the engines should be up (on those 1 or 2 engines system that seem to be the ones failing this test).
but what is this total wait time supposed to be? originally we thought just retrying forever is fine. i think this is really again an unnecessary complication to this.
you are correct, if we have a normal system with several engines. the problem here is that there are some tests with just 1 or 2 engines. i think just bumping the initial timeout to 30 seconds should be OK. maybe 30 is an overkill and 10 is more appropriate. originally it was 60, but the complaint was that we were always querying dead engines in many cases. but as i said, that issue was fixed. |
If bumping the timeout is acceptable, then my current patch maybe better since it will not increase the wait time for most of normal cases if without RPC real timeout. That is equal to the case without my patch; but if all engines are not ready in time (such as the CI test failure cases), then my patch will introduce at most 60 seconds additional wait. |
The reason im pushing back on your patch is it that it makes this proto-query unnecessarily more complicated for a non-production use case just for CI. |
Bumping the initial timeout will not only affect CI, but also may make the real production system to wait more time? |
i just explained that in the previous comment: and we can bump the timeout to maybe something like 10/20 seconds, not the default of 60 seconds. |
OK, I will close this one, and assigned to you for more general solution. |
Currently, the initial timeout for CRT_OPC_PROTO_QUERY RPC is only 3 seconds, it will help to get going more quickly when some rank(s) is down. But that increases the risk of query failure with timeout if there are only a few targets in the system and they may be busy or not ready in time when being queried.
The patch adds another one CRT_OPC_PROTO_QUERY RPC retry against the rank that has ever reported RPC timeout. Such retry will use default RPC timeout configuration instead of initial small value.
Steps for the author:
After all prior steps are complete: